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Abstract. Revelations of large scale electronic surveillance and data 
mining by governments and corporations have fueled increased adop- 
tion of HTTPS. We present a traffic analysis attack against over 6000 
webpages spanning the HTTPS deployments of 10 widely used, industry- 
leading websites in areas such as healthcare, finance, legal services and 
streaming video. Our attack identifies individual pages in the same web- 
site with 89% accuracy, exposing personal details including medical con- 
ditions, financial and legal affairs and sexual orientation. We examine 
evaluation methodology and reveal accuracy variations as large as 18% 
caused by assumptions affecting caching and cookies. We present a novel 
defense reducing attack accuracy to 27% with a 9% traffic increase, and 
demonstrate significantly increased effectiveness of prior defenses in our 
evaluation context, inclusive of enabled caching, user-specific cookies and 
pages within the same website. 

1 Introduction 

HTTPS is far more vulnerable to traffic analysis than has been previously dis- 
cussed by researchers. In a series of important papers, a variety of researchers 
have shown a number of traffic analysis attacks on SSL proxies |H2j . SSH tun- 
nels |3l4l5l6l7j . Tor |3l4l8l9j . and in unpublished work, HTTPS [TOUT] . Together, 
these results suggest that HTTPS may be vulnerable to traffic analysis. This pa- 
per confirms the vulnerability of HTTPS, but more importantly, gives new and 
much sharper attacks on HTTPS, presenting algorithms that decrease errors 
3.6x from the best previous techniques. We show the following novel results: 

— Novel attack technique capable of achieving 89% accuracy over 500 pages 
hosted at the same website, as compared to 60% with previous techniques 

— Impact of caching and cookies on traffic characteristics and attack perfor- 
mance, affecting accuracy as much as 18% 

— Novel defense reducing accuracy to 27% with 9% traffic increase; significantly 
increased effectiveness of packet level defenses in the HTTPS context 

We evaluate attack, defense and measurement techniques on websites for health- 
care (Mayo Clinic, Planned Parenthood, Kaiser Permanente), finance (Wells 
Fargo, Bank of America, Vanguard), legal services (ACLU, Legal Zoom) and 
streaming video (Netflix, YouTube). 



We design our attack to distinguish minor variations in HTTPS traffic from 
significant variations which indicate distinct traffic contents. Minor traffic varia- 
tions may be caused by caching, dynamically generated content, or user-specific 
content including cookies. Our attack applies clustering techniques to identify 
patterns in traffic. We then use a Gaussian distribution to determine similarity 
to each cluster and map traffic samples into a fixed width representation com- 
patible with a wide range of machine learning techniques. Due to similarity with 
the Bag-of- Words approach to document classification, we refer to our technique 
as Bag-of-Gaussians (BoG). This approach allows us to identify specific pages 
within a website, even when the pages have similar structures and shared re- 
sources. After initial classification, we apply a hidden Markov model (HMM) to 
leverage the link structure of the website and further increase accuracy. We show 
our approach achieves substantially greater accuracy than attacks developed by 
Panchenko et al. (Pan) [5], Liberatore and Levine (LL) [BJ, and Wang et al. [9]. 

We also present a novel defense technique and evaluate several previously 
proposed defenses. We consider deployability both in the design of our tech- 
nique and the selection of previous techniques. Whereas the previous, and less 
effective, techniques could be implemented as stateless packet filters, our tech- 
nique operates statelessly at the granularity of individual HTTP requests and 
responses. Our evaluation demonstrates that some techniques which are ineffec- 
tive in other traffic analysis contexts have significantly increased impact in the 
HTTPS context. For example, although Dyer et al. report exponential padding 
as only decreasing accuracy of the Panchenko classifier from 97.2% to 96.6%, we 
observe a decrease from 60% to 22% [S] . Our novel defense reduces the accuracy 
of the BoG attack from 89% to 27% while generating only 9% traffic overhead. 

We conduct our evaluations using a dataset of 463,125 page loads collected 
from 10 websites during December 2013 and January 2014. Our collection in- 
frastructure includes virtual machines (VMs) which operate in four separate 
collection modes, varying properties such as caching and cookie retention across 
the collection modes. By training a model using data from a specific collection 
mode and evaluating the model using a different collection mode, we are able to 
isolate the impact of factors such as caching and user-specific cookies on anal- 
ysis results. We present these results along with insights into the fundamental 
properties of the traffic itself. 

Section [2] presents the risks posed by HTTPS traffic analysis and adversaries 
who may be motivated and capable to conduct attacks. Section [3] reviews prior 
work, and in section [4] we present the core components of our attack. Section [5] 
presents the impact of evaluation conditions on reported attack accuracy, sec- 
tion [6] evaluates our attack, and section [7] presents and evaluates defense tech- 
niques. In Section [8] we discuss results and conclude. 

2 Risks of HTTPS Traffic Analysis 

This section presents an overview of the potential risks and attackers we consider 
in analyzing HTTPS traffic analysis attacks. Section [2~l"1 describes four categories 



of content, each of which we explore in this work, and potential consequences of 
a privacy violation in each category. Section |2.2| discusses adversaries who may 
be motivated and capable to conduct the attacks discussed in section [2~T] 

2.1 Privacy Applications of HTTPS 

We present several categories of website in which the specific pages accessed by 
the user are more interesting than the mere fact that the user is visiting the web- 
site at all. This notion is present in traditional privacy concepts such as patient 
confidentiality or attorney-client privilege, where the content of a communication 
is substantially more sensitive than the simple presence of communication. 

Healthcare Many medical conditions or procedures are associated with sig- 
nificant social stigma. We examine the websites of Planned Parenthood, Mayo 
Clinic and Kaiser Permanente, a healthcare provider serving 9 million members 
in the US. The page views of these websites have the potential to reveal whether 
a pending procedure is an appendectomy or an abortion, or whether a chronic 
medication is for diabetes or HIV/ AIDS. These types of distinctions and oth- 
ers can form the basis for discrimination or persecution and represent an easy 
opportunity to target advertising for products which consumers are highly mo- 
tivated to purchase. Beyond personal risks, the health care details of corporate 
and political leaders can also have significant financial implications, as evidenced 
by Apple stock fluctuations in response to reports, both true and false, of Steve 
Jobs's health [P2] . 

Legal There are many common reasons for interaction with a lawyer, such 
as completing a will, filing taxes, or reviewing a contract. However, contacting a 
lawyer to investigate divorce, bankruptcy, or legal options as an undocumented 
immigrant may attract greater interest. Since some legal advice is relatively 
unremarkable while other advice may require strict privacy, the specific details 
of legal services are more interesting than mere interaction with a lawyer. Our 
work examines LegalZoom, a website offering legal services spanning the above 
themes and others. We additionally examine the American Civil Liberties Union 
(ACLU), which offers legal information and actively litigates on a wide range of 
sensitive topics including LGBT rights, human reproduction and immigration. 

Financial While most consumers utilize some form of financial products 
to manage their personal finances, the exact products a person uses reveal a 
great deal more about their personal circumstances. For example, a user with 
educational savings accounts likely has children, a joint account is an indicator 
of a long term relationship, and high volume mutual funds offering reduced fees 
likely indicate high levels of minimum net worth. Our work examines Bank of 
America and Wells Fargo, both large banks in the US, as well as Vanguard, a 
firm offering a range of investment vehicles and brokerage services. 

Streaming Video As demonstrated during the Netflix Prize contest and 
ensuing $9 million settlement, the video rental history of an individual can po- 
tentially reveal information as personal as sexual orientation [13114] . Beyond 
any guarantees given in privacy policies, video rentals in the US are additionally 



protected by law [15]. We examine YouTube and Netflix, both of which offer 
streaming videos covering a wide range of topics. 

2.2 Attack Settings 

Having reviewed the possible consequences of traffic analysis attacks against 
HTTPS, we now examine situations in which an adversary may be motivated 
and capable to learn the types of private details previously discussed. Note that 
all capable adversaries must have at least two abilities. The adversary must be 
able to visit the same webpages as the victim, allowing the adversary to identify 
patterns in encrypted traffic indicative of different webpages. The adversary must 
also be able to observe victim traffic, allowing the adversary to match observed 
traffic with previously learned patterns. 

ISP Snooping ISPs are uniquely well positioned to target and sell ad- 
vertising since they have the most comprehensive view of the consumer. Both 
ISPs |16I17| and commercial chains of wi-fi access points [T5] , have shown efforts 
to mine customer data and/or sell advertising. Traffic analysis vulnerabilities 
would allow ISPs to conduct data mining despite the presence of encryption. 
Separate from electronic ad delivery, access points associated with businesses 
such as cafes and hotels could also deliver ads along with transaction receipts, 
physical mailings, or other special offers. 

Employee Monitoring Employers have the ability to monitor the online 
activities of employees connected to an employer provided network, regardless 
of whether the device in use is a personal or corporate device. This power has 
been abused by extensively monitoring the activities of employees [19) . even 
extending to whistleblowers whose communications are protected by law |20j . 
Traffic analysis would allow employers to remove many of the protections ex- 
pected by employees using HTTPS to protect their sensitive communications 
from untrusted parties. 

Surveillance While revelations of NSA surveillance spanning from social 
media to World of Warcraft are an unwelcome surprise to many |21l22l23l24j . 
other governments around the world have long employed these practices |25l26j . 
When asked about the efficacy of encryption, Snowden maintained "Encryption 
works. Properly implemented strong crypto systems are one of the few things 
that you can rely on. Unfortunately, endpoint security is so terrifically weak 
that NSA can frequently find ways around it" [27] . Despite this assertion, we 
still see NSA surveillance efforts specifically targeting HTTPS [55], indicating 
the value of removing side-channel attacks to ensure that HTTPS is "properly 
implemented." 

Censorship Although the consequences of forbidden internet activity can 
include imprisonment and beyond in some settings, in other contexts broad fil- 
tering efforts have resulted in lower grade punishments designed to deter further 
transgression and encourage self-censorship. For example, Chinese social media 
firm Sina has recently punished more than 100,000 users through account sus- 
pensions and occasional public admonishment for violating the country's "Seven 
Bottom Lines" guidelines for internet use [29]. Similarly, traffic analysis attacks 





Privacy 


Page Set 


Page Set 


Accuracy 






Traffic 


Analysis 


Active 


Author 


Technology 


Scope 


Size 


(%) 


Cache 


Cookies 


Composition 


Primitive 


Content 


Miller 


HTTPS 


Closed 


6396 


89 


On 


Individual 


Single Site 


Packet 


On 


Hintz □ 


SSL proxy 


Closed 


5 


inn 


? 


Individual 


Homepages 


Request 


? 


Sun [2] 


SSL proxy 


Open 


2,000 
100,000 


75 (TP) 
1.5 (FP) 


Off 


Universal 


Single Site 


Request 


Off 


Cheng llOl 


HTTPS 


Closed 


489 


96 


Off 


Individual 


Single Site 


Request 


Off 


Danezis 1 11 1 


HTTPS 


Closed 


? 


89 


n/a 


n/a 


Single Site 


Request 


n/a 


Herrmann 3 


SSH tunnel 


Closed 


775 


97 


Off 


Universal 


Homepages 


Packet 


7 


Cai m 


SSH tunnel 


Closed 


1(10 


92 


Off 


Universal 


Homepages 


Packet 


Scripts 


Dyer El 


SSH tunnel 


Closed 


775 


91 


Off 


Universal 


Homepages 


Packet 


? 


Liberatore |B| SSH tunnel 


Closed 


1000 


75 


Off 


Universal 


Homepages 


Packet 


Flash 


Bissias [7l 


SSH tunnel 


Closed 


100 


23 


? 


Universal 


Homepages 


Packet 


? 


Wang [9] 


Tor 


Open 


100 
1000 


95 (TP) 
.n6 (FP) 


Off 


Universal 


Homepages 


Packet 


Off 


Wang ED 


Tor 


Closed 


1(10 


91 


on 


Universal 


Homepages 


Packet 


Off 


Cai m 


Tor 


Closed 


1(10 


78 


On 


Universal 


Homepages 


Packet 


Scripts 


Cai m 


Tor 


Closed 


■SCO 


70 


Oil 


Universal 


Homepages 


Packet 


Scripts 


Panchenko |8H 


Tor 


Closed 


775 


55 


Oil 


Universal 


Homepages 


Packet 


Off 


Panchenko fsll 


Tor 


Open 


5 

1,000 


56-73 (TP) 
.05-.89 (FP) 


on 


Universal 


Homepages 


Packet 


Off 


Herrmann CO 


Tor 


Closed 


775 


3 


Off 


Universal 


Homepages 


Packet 


? 


Coull [30l 


Anonymous 
Trace 


Open 


50 
100 


49 
.18 


On 


Universal 


Homepages 


NetFlow 


Flash & 
Scripts 



Table 1: Prior works have focused almost exclusively on website homepages ac- 
cessed via proxy. Cheng and Danezis work is preliminary and unpublished. Eval- 
uations for both works parse object sizes from unencrypted traffic or server logs, 
which is not possible for actual encrypted traffic. Note that "?" indicates the 
author did not specify the property; several properties did not apply to Danezis 
as his evaluation used HTTP server logs. All evaluations used Linux with Firefox 
(FF) 2.0-3.6, except for Hintz and Sun (IE5), Cheng (Netscape), Wang (FF10) 
and Miller (FF22). 



could be used to degrade or block service for users suspected of viewing prohib- 
ited content over encrypted connections. 

3 Prior Work 

In this section we review attacks and defenses proposed in prior work, as well 
as the contexts in which work is evaluated. Comparisons with prior work are 
limited since much work has targeted specialized technologies such as Tor. 
Table [TJ presents an overview of prior attacks. The columns are as follows: 

Privacy Technology The encryption or protection mechanism analyzed for 
traffic analysis vulnerability. Note that some authors considered multiple 
mechanisms, and hence appear twice. 

Page Set Scope Closed indicates the evaluation used a fixed set of pages known 
to the attacker in advance. Open indicates the evaluation used traffic from 
pages both of interest and unknown to the attacker. Whereas open conditions 
are appropriate for Tor, closed conditions are appropriate for HTTPS. 



Page Set Size For closed scope, the number of pages used in the evaluation. 
For open scope, the number of pages of interest to the attacker and the 
number of background traffic pages, respectively. 

Accuracy For closed scope, the percent of pages correctly identified. For open 
scope, the true positive (TP) rate of correctly identifying a page as being 
within the censored set and false positive (FP) rate of identifying an uncen- 
sored page as censored. 

Cache Off indicates caching disabled. On indicates default caching behavior. 

Cookies Universal indicates that training and evaluation data were collected 
on the same machine or machines, and consequently with the same cookie 
values. Individual indicates training and evaluation data were collected on 
separate machines with distinct cookie values. 

Traffic Composition Single Site indicates the work identified pages within a 
website or websites. Homepages indicates all pages used in the evaluation 
were the homepages of different websites. 

Analysis Primitive The basic unit on which traffic analysis was conducted. 
Request indicates the analysis operated on the size of each object (e.g. image, 
style sheet, etc.) loaded for each page. Packet indicates meta-data observed 
from TCP packets. NetFlow indicates network traces anonymized using Net- 
Flow. 

Active Content Indicates whether Flash, JavaScript, Java or any other plugins 
were enabled in the browser. 

Several works require discussion in addition to Table [l] Danezis focused on 
the HTTPS context, but evaluated his technique using HTTP server logs at re- 
quest granularity, removing any effects of fragmentation, concurrent connections 
or pipelined requests [IT]. Cheng et al. also focused on HTTPS and conducted 
an evaluation using traffic from an HTTP website intentionally selected for its 
static content [10] . Both works were unpublished, and operated on individual 
object sizes parsed from the unencrypted traffic rather than packet metadata. 
Likewise, the approaches of Sun et al. and Hintz et al. also assume the ability 
to parse entire object sizes from traffic [112] . For these reasons, we compare our 
work to Liberatore and Levine, Panchenko et al. and Wang et al. as these are 
more advanced and recently proposed techniques. 

Herrmann [3] and Cai £Q both conduct small scale preliminary evaluations 
which involve enabling the browser cache. In contrast to our evaluation, these 
evaluations only consider website homepages and all pages are loaded in a fixed, 
round-robin order. Herrmann additionally increases the cache size from the de- 
fault to 2GB, reducing the likelihood of any cache evictions and stabilizing traf- 
fic. With caching enabled, Herrmann and Cai both observe approximately a 5% 
decrease in accuracy for their techniques, and Cai reports slightly improved per- 
formance for the Panchenko classifier. We evaluate the impact of caching on 
pages within the same website, where caching will have a greater effect than on 
the homepages of different websites due to increased page similarity, and load 
pages in a randomized order for greater cache state variation. 

Separate from attacks, we also review prior work relating to traffic analysis 
defense. Dyer et al. conduct a review of low level defenses operating on individual 




Fig. 1: Attack Presentation. The dashed line separates training workflow (above) 
from attack workflow (below). Bubbles indicate the section in which the system 
component is discussed, with §A indicating Appendix A. 



packets [5] . Dyer evaluates defenses using data released by Liberatore and Levine 
and Herrmann et al. which collect traffic from website home pages on a single 
machine with caching disabled. In this context, Dyer finds that low level defenses 
are ineffective against attacks which examine features aggregated over multiple 
packets. For example, the linear and exponential padding defenses, which pad 
packet sizes to multiples of 128 and powers of 2 respectively, reduce the accuracy 
of the Panchenko classifier at most from 97.2% to 96.6%. In our evaluation, which 
considers pages within the same website, enabled caching and identification of 
traces collected on machines separate from the attacker, we find that low level, 
stateless defenses can be considerably more effective than initially indicated by 
Dyer. 

In addition to the packet level defenses evaluated by Dyer, many defenses 
have been proposed which operate at higher levels with additional cost and 
implementation requirements. These include HTTPOS [31 , traffic morphing [55] 
and BuFLO |4l5j . HTTPOS, unlike most defenses, works from the client side 
to perturb the traffic generated by manipulating various features of TCP and 
HTTP to affect packet size, object size, pipelining behavior, packet timing and 
other properties. These manipulations require some degree of coordination and 
support from the server. BuFLO aims to provide provable defense against traffic 
analysis attacks by sending a constant stream of traffic at a fixed packet size for 
a pre-set minimum amount of time. Given the effectiveness and advantages of 
lower level defenses in our evaluation context, we do not further explore these 
higher level approaches in our work. 



4 Attack Presentation 

In this section we present our attack. Figure [T] presents an overview of the attack, 
depicting the anticipated workflow of the attacker as well as the subsections in 
which we discuss his efforts. In section 4.1 we present a formalism for identifying 
and labeling pages within a website and generating a site graph representing 



the website link structure. Section |4.2| presents the core of our classification 
approach: Gaussian clustering techniques that capture standard variations in 
traffic and allow logistic regression to robustly identify key objects which reliably 
differentiate pages. Having generated isolated predictions, we then leverage the 



site graph and sequential nature of the data in section 4.3 with a hidden Markov 
model (HMM) to further improve accuracy. 

Throughout this section we depend on several terms which we define as fol- 
lows: 

Uniform Resource Locator (URL) A character string referencing a specific 
web resource, such as an image or HTML file. 

Webpage The set of resources loaded by a browser in response to the user 
clicking a link or otherwise entering a URL into the browser address bar. 
Two webpages are the same if a user could be reasonably expected to view 
their contents as substantially similar, regardless of the specific URLs fetched 
while loading the webpages or dynamic content such as advertising. 

Sample An instance of the traffic generated when a browser displays a webpage. 

Label A unique identifier assigned to each set of webpages which are the same. 
For example, two webpages which differ only in advertising will receive the 
same label, while webpages differing in core content are assigned different 
labels. Labels are assigned to samples in accordance with the webpage con- 
tained in the sample's traffic. 

Website A set of webpages such that the URLs which cause each webpage to 
load in the browser are hosted at the same domain. 

Site Graph A graph representing the link structure of a website. Nodes corre- 
spond to labels assigned to webpages within the website. Edges correspond 
to links between webpages, represented as the set of labels reachable from a 
given label. 



4.1 Label and Site Graph Generation 

This section presents our approach to labeling and site graph generation. Merely 
treating the URL which causes a webpage to load as the label for the webpage 
is not sufficient for analyzing webpages within the same website. URLs may 
contain arguments, such as session IDs, which do not impact the content of the 
webpage and result in different labels aliasing webpages which are the same. 
This prevents accumulation of sufficient training samples for each label and 
hinders evaluation. URL redirection further complicates labeling; the same URL 
may refer to multiple webpages (e.g. error pages or A/B testing) or multiple 
URLs may refer to the same webpage. We present a labeling solution based on 
URLs and designed to accommodate these challenges.^] When labeling errors are 
inevitable, we prefer to have a single webpage aliased to multiple labels rather 

3 While URL redirection may be implemented within the web server or via JavaScript 
that alters webpage contents, allowing a single URL to represent many webpages, 
this behavior is limited in practice because website designers are motivated to allow 
search engines to link to webpages in search results. 



than have multiple distinct pages aliased to a single label. The former may result 
in lower accuracy ratings, but it allows our attacker to learn correct information. 

Our approach contains two phases. In the first phase, we conduct a prelim- 
inary crawl of the website, yielding many URLs from links encountered during 
the crawl. We then analyze these URLs to produce a canonicalization function 
which, given a URL, returns a canonical label for the webpage loaded as result of 
entering the URL into a browser address barj^We use the canonicalization func- 
tion to produce a preliminary site graph, which guides further crawling activity. 
Our approach proceeds in two phases because the non-deterministic nature of 
URL redirections requires the attacker to conduct extensive crawling to observe 
the full breadth of both URLs and redirections, and crawling can not be con- 
ducted without a basic heuristic identifying URLs which likely alias the same 
webpage. As we describe below our approach allows, but does not require, the 
second crawl to be combined with training data collection. After the second phase 
is complete, both the labels and site graph are refined using the additional URLs 
and redirections observed during the crawl. We present our approach below. 

Execute Preliminary Crawl The first step in developing labels and a site 
graph is to crawl the website. The crawl can be implemented as either a depth- 
or breadth-first search, beginning at the homepage and exploring every link on a 
page up to a fixed maximum depth. We perform a breadth first search to depth 
5. This crawl will produce a graph G = (U,E), where U represents the set of 
URLs seen as links during the crawl, and E = {(u,u') € U x U \ u links to u'} 
represents links between URLs in U. 

Produce Canonicalization Function Since multiple URLs may cause 
webpages which are effectively the same to load when entered into a browser 
address bar, the role of a canonicalization function is to produce a canonical 
label given a URL. The canonicalization function will be of the form C : U — > L, 
where C denotes the canonicalization function, U denotes the initial set of URLs, 
and L denotes the set of labels. To maintain the criterion that we error on the 
side of multiple labels aliasing the same webpage, our approach forces any URLs 
with different paths to be assigned different labels and selectively identifies URL 
arguments that appreciably impact webpage content for inclusion in the label. 
We were able to execute this phase on all websites we surveyed. See Appendix B 
for our full approach, including several heuristics independent of URL arguments 
which further guide canonicalization. 

Canonicalize Initial Graph We use our canonicalization function to pro- 
duce an initial site graph G' = (L,E') where L represents the set of labels on 
the website and E' represents links. We construct E' as follows: 

E> = {(C(u),C(u / ))\(u,u / )eE} (1) 

We define a reverse canonicalization function R : L — > V(U) such that 

R(l) = {ueU\ C(u) = 1} (2) 

Note that V(X) denotes the power set of X, which is the set of all subsets of X. 

4 Note that we label pages based on the final URL after any URL redirection occurs. 



Identify Browsing Sessions The non-deterministic nature of URL redirec- 
tion requires the attacker to observe many redirection examples to finalize the 
site graph and canonicalization function. This process can also be used to collect 
training data. To collect training data and observe URL redirections the attacker 
builds a list of browsing sessions, each consisting of a fixed length sequence of la- 
bels. We fix the length of our browsing sessions to 75 labels. For cache accuracy, 
the attacker builds browsing sessions using a random walk through G' . Since the 
graph structure prevents visiting all nodes evenly, the attacker prioritizes labels 
not yet visited. When the portion of duplicate labels reaches a fixed threshold 
(we used 0.6), the attacker visits the remaining labels regardless of the graph 
link structure until all labels have received at least single visit. This process is 
repeated until the attacker has produced enough browsing sessions to collect the 
desired amount of training data; we collected at least 64 samples of each label 
in total. 

Execute Browsing Sessions To generate traffic samples the attacker se- 
lects a URL u for each label I in a browsing session such that u € R(l) and 
loads u and (all supporting resources) by effectively entering u into a browser 
address bar. The attacker records the value of document . location (once the 
entire webpage is done loading) to identify any URL redirections. V denotes 
the set of final URLs which are observed in document . location. We define a 
new function T : U — > V(U') such that T(u) = {u' e U' \ u resolved at least 
once to u'}. We use this to define a new translation T' : L — > V(U') such that 

T'(l)= |J T(u) (3) 

u£R(l) 

Refine Canonicalization Function Since the set of final URLs U' may 
include arguments which were not present in the original set U, we refine our 
canonicalization function C to produce a new function C : U' — > L' , where 
L' denotes a new set of labels. The refinement is conducted using the same 
techniques as we used to produced C. Samples are labeled as C'{v!) G L' where 
v! denotes the value of document . location when the sample finished loading. 

Refine Site Graph Since the final set of labels V may contain labels which 
are not in L, the attacker must update G' . The update must maintain the prop- 
erty that any sequence of labels l' 0 ,l[,... e V observed during data collection 
must be a valid path in the final graph. Therefore, the attacker defines a new 
graph G" = (W, E") such that E" is defined as 

E" = {(u, u') | u € T'(l) A u 1 e T'(l') V (/, 0 e E'} (4) 

We apply our canonicalization function C to produce a final graph G'" = 
(L', E'") where 

E'" = {(C'(u),C'{u')) V («, u') G E"} (5) 

maintaining the property that any sequence of labels observed during training is 
a valid path in the final graph. Note that our evaluation generates a separate site 
graph for each model, using only redirections which occurred in training data. 
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Table 2: Site graph and canonicalization summary. "Selected Subset" denotes 
the subset of the preliminary site graph which we randomly select for inclusion 
in our evaluation, "Avg. Links" denotes the average number of links per label, 
and "URLs" indicates the number of URLs seen as links in the preliminary site 
graph corresponding to an included label. 



This leaves the possibility of a path in evaluation data which is not valid on the 
attacker site graph, but we did not find this to be an issue in practice. 

For the purposes of this work, we augment the above approach to select a 
subset of the preliminary site graph for further analysis. By surveying a subset 
of each website, we are able to explore additional websites and browser config- 
urations and remain within our resource constraints. We initialize the selected 
subset to include the label corresponding to the homepage, and iteratively ex- 
pand the subset by adding a label reachable from the selected subset via the 
link structure of the preliminary site graph until 500 labels are selected. The 
set of links for the graph subset is defined as any links occurring between the 
500 selected labels. Table [2] presents properties of the preliminary site graph 
G", selected subset, and the final site graph G'" for each of the 10 websites 
we survey. We implement the preliminary crawl using Python and the second 
crawl (i.e. training data collection) using the browsing infrastructure described 
in Appendix A. 

4.2 Feature Extraction and Machine Learning 

This section presents our individual sample classification technique. First, we 
describe the information which we extract from a sample, then we describe 
processing to produce features for machine learning, and finally describe the 
application of the learning technique itself. 

We initially extract traffic burst pairs from each sample. Burst pairs are de- 
fined as the collective size of a contiguous outgoing packet sequence followed by 
the collective size of a contiguous incoming packet sequence. Intuitively, con- 
tiguous outgoing sequences correspond to requests, and contiguous incoming 
sequences correspond to responses. All packets must occur on the same TCP 
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Fig. 2: Table |2c| displays the burst pairs extracted from three hypothetical sam- 



ples. Figures [2a] and 2b show the result of burst pair clustering. Figure [2d] de- 
picts the Bag-of-Gaussians features for each sample, where each feature value is 
defined as the total likelihood of all points from a given sample at the relevant 
domain under the Gaussian distribution corresponding to the feature. Our Gaus- 
sian similarity metric enables our attack to distinguish minor traffic variations 
from significant differences. 



connection to minimize the effects of interleaving traffic. For example denoting 
outgoing packets as positive and incoming packets as negative, the sequence 
[+1420, +310, -1420, -810, +530, -1080] would result in the burst pairs 
[1730, 2230] and [530, 1080] . Analyzing traffic bursts removes any fragmen- 
tation effects. Additionally, treating bursts as pairs allows the data to contain 
minimal ordering information and go beyond techniques which focus purely on 
packet size distributions. 

Once burst pairs are extracted from each TCP connection, the pairs are 
grouped using the second level domain of the host associated with the desti- 
nation IP of the connection. All IPs for which the reverse DNS lookup fails are 
treated as a single "unknown" domain. Pairs from each domain undergo k-means 
clustering to identify commonly occurring and closely related tuples. Since tu- 
ples correspond to individual requests and pipelined series of requests, some 



tuple values will occur on multiple webpages while other tuples will occur only 
on individual webpages. Once clusters are formed we fit a Gaussian distribution 
to each cluster and treat each cluster as a feature dimension, producing our 
hxcd-width feature vector. Features are extracted from samples by computing 
the extent to which each Gaussian is represented in the sample. 

Figure [2] depicts the feature extraction process using a fabricated example 
involving three samples and two domains. Clustering results in five clusters, 
indexed 1-5, for Domain A and three clusters, indexed 6-8, for Domain B. The 
feature vector thus has eight dimensions, with one corresponding to each cluster. 
Sample x has traffic tuples in clusters 1, 2, 5, 6 and 7, but no traffic tuples in 
clusters 3, 4, 8, so its feature vector has non-zero values in dimensions 1, 2, 5, 6, 
7, and zero values in dimensions 3, 4, 8. We create feature vectors for samples + 
and o in a similar fashion. 

We specify our approach formally as follows: 

— Let X denote the entire set of tuples from a trace, with X d C X denoting 
set all tuples observed at domain d. 

— Let Sf, /j,f respectively denote the covariance and mean of Gaussian i at 
domain d. 

— Let F denote all features, with Ff denoting feature i from domain d. 



To determine the best number of Gaussian features for each domain, we train 
models using a range of values of K and then select the best performing model 
for each domain. 

Analogously to the Bag-of- Words document processing technique, our ap- 
proach projects a variable length vector of tuples into a finite dimensional space 
where each dimension "occurs" to some extent in the original sample. Whereas 
occurrence is determined by word count in Bag-of- Words, occurrence in our 
method is determined by Gaussian likelihood. For this reason, we refer to our 
approach as Bag-of-Gaussians (BoG). 

Once Gaussian features have been extracted from each sample the feature 
set is augmented to include counts of packet sizes observed over the entire trace. 
For example, if the lengths of all outgoing and incoming packets are between 1 
and 1500 bytes, we add 3000 additional features where each feature corresponds 
to the total number of packets sent in a specific direction with a specific size. We 
linearly normalize all features to be in the range [0, 1] and train a model using 
L2 regularized multi-class logistic regression with C = 128 using the liblinear 
package [3"3"] . 

4.3 Hidden Markov Model 
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The basic attack presented in section 4.2 classifies each sample independently. 
In practice, samples in a browsing session are not independent since the link 
structure of the website guides the browsing sequence. We leverage this ordering 



information, contained in the site graph produced in section |4.1[ to improve 
results using a hidden Markov model (HMM). Recall that a HMM for a sequence 
of length N is defined by a set of latent variables Z = {z n | 1 < n < N}, a set 
of observed variables X — {x n | 1 < n < N}, transition matrix A such that 
Aij = P(Z n+ i = j\Z n = i), an initial distribution n such that iTj = P(Z\ = j) 
and an emission function E(x n ,z n ) — P(x n \z n ). 

Applied to our context, the HMM is configured as follows: 

— Latent variables z n correspond to labels I' G L' visited by the victim during 
browsing sessions 

— Observed variables x n correspond to observed feature vectors X 

— Initial distribution 7r assigns an equal likelihood to all pages 

— Transition matrix A encodes E'", the set of links between pages in L' , such 
that all links have equal likelihood 

— Emission function E(x n ,z n ) — P(z n \x n ) determined by logistic regression 

After obtaining predictions with logistic regression, the attacker refines the 
predictions using the Viterbi algorithm to solve for the most likely values of the 
latent variables, each of which corresponds to a pageview by the user. 

5 Impact of Evaluation Conditions 

In this section we demonstrate the impact of evaluation conditions on accuracy 
results and traffic characteristics. First, we present the scope and motivation of 
our investigation. Then, we describe the experimental methodology we use to 
determine the impact of evaluation conditions. Finally, we present the results of 
our experiments on four attack implementations. All attacks are impacted by 
the evaluation condition variations we consider, with the most affected attack 
decreasing accuracy from 68% to 50%. We discuss attack accuracy in this section 
only insofar as is necessary to understand the impact of evaluation conditions; 
we defer a full attack evaluation to section [6l 

Cache Configuration The browser cache improves page loading speed by 
storing web resources which were previously loaded, potentially posing two chal- 
lenges to traffic analysis. Providing content locally decreases the total amount 
of traffic, reducing the information available for use in an attack. Additionally, 
differences in browsing history can result in differences in cache contents and 
further vary network traffic. Since privacy tools such as Tor frequently disable 
caching, many prior evaluations have disabled caching as well [31]. But in prac- 
tice, general users of HTTPS typically do not modify default cache settings, so 
we evaluate the impact of enabling caching to default settings. 

User-Specific Cookies If an evaluation collects all data with either the 
same browser instance or repeatedly uses a fresh browser image (such as the 
Tor browser bundle), there are respective assumptions that the attacker and 
victim will either share the same cookies or use none at all. While a traffic 
analysis attacker will not have access to victim cookies, privacy technologies 
which begin all browsing sessions from a clean browsing image effectively share 



the null cookie. We compare the performance of evaluations which use the same 
(non-null) cookie value in all data, different (non-null) cookie values in training 
and evaluation, a null cookie in all data, and evaluations which mix null and 
non-null cookies. 

Pageview Diversity Many evaluations collect data by repeatedly visiting 
a fixed set of URLs from a single machine and dividing the collected data for 
training and evaluation. This approach introduces an unrealistic assumption 
that, during training, an attacker will be able to visit the same set of webpages 
as the victim. Note that this would require collecting separate training data for 
each victim given that each victim visits a unique set of pages. We examine 
the impact of allowing the victim to intersperse browsing of multiple websites, 
including websites outside our attacker's monitoring focusj^j 

Webpage Similarity Since HTTPS will usually allow an eavesdropper to 
learn the domain a user is visiting, our evaluation focuses on efforts to differen- 
tiate individual webpages within a website protected by HTTPS. Differentiating 
webpages within the same website may pose a greater challenge than differentiat- 
ing website homepages. Webpages within the same website share many resources, 
increasing the effect of caching and making webpages within a website harder 
to distinguish. We examine the relative traffic volumes of browsing both website 
homepages and webpages within a website. 

To quantify the impact of evaluation conditions on accuracy results, we design 
four modes for data collection designed to isolate specific effects. Our approach 
assumes that data will be gathered in a series of browsing sessions, where each 
session consists of loading a fixed number of URLs in a browser. The four modes 
are as follows: 

1. Cache disabled, new virtual machine (VM) for each browsing session 

2. Cache enabled, new VM for each browsing session 

3. Cache enabled, persistent VM for all browsing sessions, single site per VM 

4. Cache enabled, persistent VM for all browsing sessions, all sites on same VM 

In our experiments we fixed the session length to 75 URLs and collect at least 
16 samples of each label under each collection mode. We begin each browsing 
session in the first two configurations with a fresh VM image to eliminate the 
possibility of cookie persistence in browser or machine state. The first and second 
modes differ only with respect to cache configuration, allowing us to quantify the 
impact of caching. In effect the second, third and fourth modes each represent a 
distinct cookie value, with the second mode representing a null cookie and the 
third and fourth modes having actual, distinct, cookie values set by the site. The 
third and fourth modes differ in pageview diversity. In the context of HTTPS 
evaluations, the fourth mode most closely reflects the behavior of actual users 

5 Note that this is different from an open-world vs. closed-world distinction as de- 
scribed in section [3] as we assume that the attacker will train a model for each 
website in its entirety and be able to identify the correct model based on traffic des- 
tination. Here, we are concerned with any effects on browser cache or personalized 
content which may impact traffic analysis. 
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Fig. 3: Disabling the cache significantly increases the number of unique packet 
sizes for samples of a given label. For each label I we determine the mean number 
l m of unique packet sizes for samples of I with caching enabled, and normalize 
the unique packet size counts of all samples of label I using l m . We present the 
normalized values for all labels separated by cache configuration. 



and hence serves as evaluation data, while the second and third modes generate 
training data. 

Our analysis reveals that caching significantly decreases the number of unique 
packet sizes observed for samples of a given label. We focus on the number of 
unique packet sizes since packet size counts are a commonly used feature in 
traffic analysis attacks. A reduction in the number of unique packet sizes reduces 
the number of non-zero features and creates difficulty in distinguishing samples. 
Figure [3] contrasts samples from the first and second collection modes, presenting 
the effect of caching on the number of unique packet sizes observed per label for 
each of the 10 websites we evaluate. Note that the figure only reflects TCP data 
packets. We use a normalization process to present average impact of caching on 
a per-label basis across an entire website, allowing us to depict for each website 
the expected change in number of unique packet sizes for any given label as a 
result of disabling the cache. 

Beyond examining traffic characteristics, our analysis shows that factors such 
as caching, user-specific cookies and pageview diversity can cause variations as 
large as 18% in measured attack accuracy. We examine each of these factors 
by training a model using data from a specific collection mode, and comparing 
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Fig. 4: "Train: X, Eval: Y" indicates training data from mode X and evaluation 
data from mode Y as shown in Table |4d| For evaluations which use a privacy 
tool such as the Tor browser bundle and assume a closed world, training and 
evaluating using mode 1 is most realistic. However, in the HTTPS context train- 
ing using mode 2 or 3 and evaluating using mode 4 is most realistic. Figure [4c] 
presents differences as large as 18% between these conditions, demonstrating the 
importance of evaluation conditions when measuring attack accuracy. 



model accuracy when evaluated on a range of collection modes. Since some mod- 
els must be trained and evaluated using data from the same collection mode, we 
use 8 training samples per label and leave the remaining 8 samples for evaluation. 
Figure [4] presents the results of our analysis: 

Cache Effect Figure [4a] compares the performance of models trained and eval- 
uated using mode 1 to models trained and evaluated using mode 2. As these 
modes differ only by enabled caching, we see that caching has moderate 
impact and can influence reported accuracy by as much as 10%. 

Cookie Effect Figure [4b] measures the impact of user-specific cookies by com- 
paring the performance of models trained and evaluated using browsing 
modes 2 and 3. We observe that both the null cookie in mode 2 and the 
user-specific cookie in mode 3 perform 5-10 percentage points better when 
the evaluation data is drawn from the same mode as the training data. This 
suggests that any difference in cookies between training and evaluation con- 
ditions will impact accuracy results. 



Total Effect Figure 4c presents the combined effects of enabled caching, user- 



specific cookies and increased pageview diversity. Recalling Figure 4b , notice 



that models trained using mode 2 perform similarly on modes 3 and 4, and 
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Fig. 5: Decrease in traffic volume caused by browsing webpages internal to a 
website as compared to website homepages. Similar to the effect of caching, 
the decreased traffic volume is likely to increase classification errors. Note that 
packet count ranges are selected to divide internal pages into 5 even ranges. 



models trained using mode 3 perform similarly on modes 2 and 4, confirm- 
ing the importance of user-specific cookies. In total, the combined effect of 
enabled caching, user-specific cookies and pageview diversity can influence 
reported accuracy by as much as 19%. Figure 4b suggests that the effect is 
primarily due to caching and cookies since mode 2 generally performs better 
on mode 4, which includes visits to other websites, than on mode 3, which 
contains traffic from only a single website. 



Since prior works have focused largely on website homepages but HTTPS 
requires identification of webpages within the same website, we present data 
demonstrating a decrease in traffic when browsing webpages within a website. 
Figure [5] presents the results of browsing through the Alexa top 1,000 websites, 
loading the homepage of each site, and then loading nine additional links on 
the site at random with caching enabled. By partitioning the total count of data 
packets transferred in the loading of webpages internal to a website into five equal 
size buckets, we see that there is a clear skew towards homepages generating 
more traffic, reflecting increased content and material for traffic analysis. This 
increase, similar to the increase caused by disabled caching, is likely to increase 
classification errors. 



6 Attack Evaluation 



In this section we evaluate the performance of our attack. We begin by presenting 
the selection of previous techniques for comparison and the implementation of 
each attack. Then, we present the aggregate performance of each attack across 
all 10 websites we consider, the impact of training data on attack accuracy, and 
the performance each attack at each individual website. 

We select the Liberatore and Levine (LL), Panchenko et al. (Pan), and Wang 
et al. attacks for evaluation in addition to the BoG attack. The LL attack offers 
a view of baseline performance achievable from low level packet inspection, ap- 
plying naive Bayes to a feature set consisting of packet size counts [5] . We imple- 
mented the LL attack using the naive Bayes implementation in scikit-learn [35 
The Pan attack extends size count features to include additional features re- 
lated to burst length as measured in both packets and bytes as well as total 
traffic volume [8]. For features aggregated over multiple packets, the Pan attack 
rounds feature values to predetermined intervals. We implement the Pan attack 
using the libsvm |36j implementation of the RBF kernel support vector ma- 
chine with the C and 7 parameters specified by Panchenko. We select the Pan 
attack for comparison to demonstrate the significant benefit of Gaussian similar- 
ity rather than predetermined rounding thresholds. The BoG attack functions as 
described in section [I] We implement the BoG attack using the k-means pack- 
age from sof ia-ml |37| and logistic regression with class probability output from 
liblinear [33], with Numpy [35] performing intermediate computation. 

The Wang attack assumes a fundamentally different approach from LL, Pan 
and BoG based on string edit distance [9] . There are several variants of the Wang 
attack which trade computational cost for accuracy by varying the string edit 
distance function. Wang reports that the best distance function for raw packet 
traces is the Optimal String Alignment Distance (OSAD) originally proposed by 
Cai et al. [I] . Unfortunately, the edit distance must be computed for each pair of 
samples, and OSAD is extremely expensive. Therefore, we implement the Fast 
Levenshtein-Like (FLL) distance^ an alternate edit distance function proposed 
by Wang which runs approximately 3000x faster Q Since Wang demonstrates that 
FLL achieves 46% accuracy operating on raw packet traces, as compared to 74% 
accuracy with OSAD, we view FLL as a rough indicator of the potential of the 
OSAD attack. We implement the Wang - FLL attack using scikit-learn [35] . 

We now examine the performance of each attack implementation. We evalu- 
ate attacks using data collected in mode 4 since this mode is most similar to the 
behavior of actual users. We consider both modes 2 and 3 for training data to 

6 Note that the original attack rounded packet sizes to multiples of 600; we operate 
on raw packet sizes as we found this improves attack accuracy in our evaluation. 

7 OSAD has 0(mn) runtime where m and n represent the length of the strings, 
whereas FLL runs in 0(m + n). Wang et al. report completing an evaluation with 
40 samples of 100 labels each in approximately 7 days of CPU time. Since our eval- 
uation involves 10 websites with approx. 500 distinct labels each and 16 samples of 
each label for training and evaluation, we would require approximately 19 months 
of CPU time (excluding any computation for sections l5| or |7| . 
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Fig. 6: Performance of BoG attack and prior techniques. Figure 
performance of all four attacks as a function of training data. Figure 6b presents 
the accuracy of the BoG attack trained with 16 samples as a function of browsing- 
session length. Note that the BoG attack achieves 89% accuracy as compared to 
60% accuracy with the best prior work. 



avoid any bias introduced by using the same cookies as seen in evaluation data 
or browsing the exact same websites. As shown in Figure |4j mode 2 performs 
slightly better so we train all models using data from mode 2. 

Consistent with prior work, our evaluation finds that accuracy of each attack 
improves with increased training data, as indicated by Figure [6a| Notice that the 
Pan attack is most sensitive to variations in the amount of training data, and the 
BoG attack continues to perform well even at low levels of training data. In some 
cases an attacker may have ample opportunity to collect training data, although 
in other cases the victim website may attempt to actively resist traffic analysis 
attacks by detecting crawling behavior and engaging in cloaking, rate limiting or 
other forms of blocking. These defenses would be particularly feasible for special 
interest, low volume websites where organized, frequent crawling would be hard 
to conceal. 

The BoG attack derives significant performance gains from the application 
of the HMM. Figure 6b presents the BoG attack accuracy as a function of the 
browsing session length. Although we collect browsing sessions which each con- 
tain 75 samples, we simulate shorter browsing sessions by applying the HMM to 
randomly selected subsets of browsing sessions and observing impact on accu- 
racy. At session length 1 the HMM has no effect and the BoG attack achieves 
72% accuracy, representing the improvement over the Pan attack resulting from 
the Gaussian feature extraction. The HMM accounts for the remaining perfor- 
mance improvement from 72% accuracy to 89% accuracy. We achieve most of 
the benefit of the HMM after observing two samples in succession, and the full 
benefit after observing approximately 15 samples. Although any technique which 
assigns a likelihood to each label for each sample could be extended with a HMM, 
applying a HMM requires solving the labeling and site graph challenges which 
we present novel solutions for in section |4j 
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Fig. 7: Accuracy of each attack for each website. Note that the BoG attack per- 
forms the worst at Kaiser Permanente, Mayo Clinic and Netflix, which each have 
approx. 1000 labels in their final site graphs according to Table [2] The increase 
in graph size during finalization suggests potential for improved performance 
through better canonicalization to eliminate labels aliasing the same webpages. 



Although the BoG attack averages 89% accuracy overall, only 4 of the 10 
websites included in evaluation have accuracy below 92%. Figure [7] presents the 
accuracy of each attack at each website. The BoG attack performs the worst at 
Mayo Clinic, Netflix and Kaiser Permanente. Notably, the number of labels in the 
site graphs corresponding to each of these websites approximately doubles during 
the finalization process summarized in Table [2] of section |4j URL redirection 
causes the increase in labels, as new URLs appear whose corresponding labels 
were not included in the preliminary site graph. Some new URLs may have been 
poorly handled by the canonicalization function, resulting in labels which alias 
the same content. Although we collected supplemental data to gather sufficient 
training samples for each label, the increase in number of labels and label aliasing 
behavior degrade measured accuracy for all attacks. 

Despite the success of string edit distance based attacks against Tor, the 
Wang - FLL attack struggles in the HTTPS setting. Wang's evaluation is con- 
fined to Tor, which pads all packets into fixed size cells, and does not effectively 
explore edit distance approaches applied to unpadded traffic. Consistent with 
the unpadded nature of HTTPS, we observe that Wang's attack performs best 
on unpadded traffic in the HTTPS setting. Despite this improvement, the Wang 
- FLL technique may struggle because edit distance treats all unique packet sizes 
as equally dissimilar; for example, 310 byte packets are equally similar to 320 
byte packets and 970 byte packets. Additionally, the application of edit distance 
at the packet level causes large objects sent in multiple packets to have propor- 
tionally large impact on edit distance. This bias may be more apparent in the 
HTTPS context than with website homepages since webpages within the same 
website are more similar than homepages of different websites. Replacing the 
FLL distance metric with OSAD or Damerau-Levenshtein would improve at- 
tack accuracy, although the poor performance of FLL suggests the improvement 
would not justify the cost given the alternative techniques available. 
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Fig. 8: Cost and effectiveness of defense techniques. Figure 8a presents the impact 
of defenses on attack accuracy, and Figure [8b] presents the cost of defenses. The 
Burst defense is a novel proposal in this work, offering a substantial decrease in 
accuracy at a cost comparable to a low level defense. 



7 Defense 

This section presents and evaluates several defense techniques, including our 
novel Burst defense which operates between the application and TCP layers to 
obscure high level features of traffic while minimizing overhead. Figure [H] presents 
the effectiveness and cost of the defenses we consider. We find that evaluation 
context significantly impacts defense performance, as we observe increased ef- 
fectiveness of low level defenses in our evaluation as compared to prior work [5] . 
Additionally, we find that the Burst defense offers significant performance im- 
provements while maintaining many advantages of low level defense techniques. 

We select defenses for evaluation on the combined basis of cost, deployability 
and effectiveness. We select the linear and exponential padding defenses from 
Dyer et al. as they are reasonably effective, have the lowest overhead, and are 
implemented statelessly below the TCP layer. The linear defense pads all packet 
sizes up to multiples of 128, and the exponential defense pads all packet sizes up 
to powers of 2. Stateless implementation at the IP layer allows for easy adoption 
across a wide range of client and server software stacks. Additionally, network 
overhead is limited to minor increases in packet size with no new packets gener- 
ated, keeping costs low in the network and on end hosts. We also introduce the 
fragmentation defense which randomly splits any packet which is smaller than 
the MTU, similar to portions of the strategy adopted by HTTPOS [31 . Fragmen- 
tation offers the deployment advantages of not introducing any additional data 
overhead, as well as being entirely supported by current network protocols and 
hardware. We do not consider defenses such as BuFLO or HTTPOS given their 
complexity, cost and the effectiveness of the alternatives we do consider |5l31j . 

The exponential defense slightly outperforms the linear defense, decreasing 
the accuracy of the Pan attack from 60% to 22% and the BoG attack from 89% 
to 59%. Notice that the exponential defense is much more effective in our evalua- 



tion context than Dyer's context, which focuses on comparing website homepages 
loaded over an SSH tunnel with caching disabled and evaluation traffic collected 
on the same machine as training traffic. The fragmentation defense is extremely 
effective against the LL and Wang - FLL attacks, reducing accuracy to below 
1% for each attack, but less effective against the Pan and BoG attacks. The Pan 
and BoG attacks each perform TCP stream re-assembly, aggregating fragmented 
packets, while LL and Wang - FLL do not. Since TCP stream re-assembly is ex- 
pensive and requires complete access to victim traffic, the fragmentation defense 
may be a superior choice against many adversaries in practice. 

Although the fragmentation, linear and exponential defenses offer the deploy- 
ment advantages of functioning statelessly below the TCP layer, their effective- 
ness is limited. The Burst defense offers greater protection, operating between 
the TCP layer and application layer to pad contiguous bursts of traffic up to 
pre-defined thresholds uniquely determined for each website. The Burst defense 
allows for a natural tradeoff between performance and cost, as fewer thresholds 
will result in greater privacy but at the expense of increased padding. 

Unlike the BoG attack which considers bursts as a tuple, padding by the 
Burst defense is independent in each direction. We determine Burst thresholds 
as shown in Algorithm [T] repeating the algorithm for each direction. We pad 
traffic bursts as shown in Algorithm [2] 



Algorithm 1 Threshold Calculation 

Precondition: bursts is a set containing the length of each burst in a given direction 

in defense training traffic 
Precondition: threshold specifies the maximum allowable cost of the Burst defense 

1: thresholds <— setQ 

2: bucket <— setQ 

3: for b in sorted(bursts) do 

4: inflation <— len(bucket + b) * max(bucket + b) / sum(bucket + b) 

5: if inflation > threshold then 

6: thresholds thresholds + max(bucket) 

7: bucket <s— setQ + b 

8: else 

9: bucket 4— bucket + b 

10: end if 
11: end for 

12: return thresholds + max(bucket) 



We evaluate the Burst defense for threshold values 1.10 and 1.40, with the 
resulting cost and performance shown in Figure [HJ The Burst defense outper- 
forms defenses which operate solely at the packet level by obscuring features 
aggregated over entire TCP streams. Simultaneously, the Burst defense offers 
deployability advantages over techniques such as HTTPOS since the Burst de- 
fense is implemented between the TCP and application layers. The cost of the 
Burst defense compares favorably to defenses such as HTTPOS, BuFLO and 



Algorithm 2 Burst Padding 

Precondition: burst specifies the size of a directed traffic burst 
Precondition: thresholds specifies the thresholds obtained in Algorithm [l] 

1: for t in sorted(thresholds) do 

2: if t > burst then 

3: return t 

4: end if 

5: end for 

6: return burst 



traffic morphing, reported to cost at least 37%, 94% and 50% respectively |4l32j . 
Having demonstrated the performance and favorable cost of the Burst defense, 
we plan to address further comparative evaluation in future work. 

8 Discussion and Conclusion 

This work examines the vulnerability of HTTPS to traffic analysis attacks, focus- 
ing on evaluation methodology, attack and defense. Although we present novel 
contributions in each of these areas, many open problems remain. 

Our examination of evaluation methodology focuses on caching and user- 
specific cookies, but does not explore factors such as browser differences, operat- 
ing system differences, mobile/tablet devices or network location. Each of these 
factors may contribute to traffic diversity in practice, likely degrading attack ac- 
curacy. In the event that differences in browser, operating system or device type 
negatively influence attack results, we suggest that these differences may be han- 
dled by collecting separate training data for each client configuration. We then 
suggest constructing a HMM that contains isolated site graphs for each client 
configuration. During attack execution, the classifier will likely assign higher 
likelihoods to samples from the client configuration matching the actual client, 
and the HMM will likely focus prediction within a single isolated site graph. By 
identifying the correct set of training data for use in prediction, the HMM may 
effectively minimize the challenge posed by multiple client configurations. We 
leave refinement and evaluation of this approach as future work. 

Additional future work remains in the area of attack development. To date, 
all approaches have assumed that the victim browses the web in a single tab 
and that successive page loads can be easily delineated. Future work should 
investigate actual user practice in these areas and impact on analysis results. For 
example, while many users have multiple tabs open at the same time, it is unclear 
how much traffic a tab generates once a page is done loading. Additionally, we 
do not know how easily traffic from separate page loadings may be delineated 
given a contiguous stream of user traffic. Lastly, our work assumes that the 
victim actually adheres to the link structure of the website. In practice, it may 
be possible to accommodate users who do not adhere to the link structure by 
introducing strong and weak transitions rather than a binary transition matrix, 



where strong transitions are assigned high likelihood and represent actual links 
on a website and weak transitions join all unlinked pages and are assigned low 
likelihood. In this way the HMM will allow transitions outside of the site graph 
provided that the classifier issues a very confident prediction. 

Defense development and evaluation also require further exploration. At- 
tack evaluation conditions and defense development are somewhat related since 
conditions which favor attack performance will simultaneously decrease defense 
effectiveness. Defense evaluation under conditions which favor attack creates the 
appearance that defenses must be complex and expensive, effectively discourag- 
ing defense deployment. To increase likelihood of deployment, future work must 
investigate necessary defense measures under increasingly realistic conditions 
since realistic conditions may substantially contribute to effective defense. 

This work has involved substantial implementation, data collection and com- 
putation. To facilitate future comparative work, our data collection infrastruc- 
ture, traffic samples, attack and defense implementations and results are avail- 
able upon request. 
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Appendix A 

In this appendix we describe the software system used to collect traffic sam- 
ples. To allow parallel collection of data, we collected traffic inside a VirtualBox 
4.2.16 VM running Linux 12.04 and Firefox 22. We ran 4 VMs at a time on the 
same workstation with a quad core Xeon processor and I2GB of RAM. We ran 
only 4 VMs to allow each VM sufficient processor, memory, disk and network 
resources to prevent any dropped packets or delays in website loading. Since 
some collection modes require a fresh VM image for each browsing session, we 
disabled automatic updates in Ubuntu and Firefox as these would have consis- 
tently downloaded updates and contaminated traffic samples. 

Figure [9] depicts the software components of the collection infrastructure. The 
HostDriver managed the collection process, including booting VMs and assigning 
workloads. The HostDriver shared a folder with each VM containing a user.js 
file used to configure Firefox caching behavior and list of URLs to visit. To 
disable the Firefox cache, we set 15 caching related configuration options listed in 
Firefox source file all . j s. Each VM launched the GuestDriver script after boot, 
which then launched Firefox using the supplied user . j s configuration file. A 
Greasemonkey script (version 0.9.20) installed in Firefox then successively visited 
each of the listed URLs. After each webpage had fully loaded the Greasemonkey 
script waited 3 seconds to allow for any JavaScript URL redirection. Once any 
redirection had finished, the script waited an additional 4 seconds to ensure all 
traffic generated by the page was collected. Greasemonkey then issued a blocking 
request to a server running locally on the VM which caused the server to stop 
and restart TCPDUMP, thereby creating separate PCAP files for each sample. 
Note that we disabled Content Security Policy as several sites had policies which 
prevented our Greasemonkey script. 



Appendix B 



In this appendix we further describe the techniques we use to produce the canon- 
icalization function. Recall that the canonicalization function transforms the 
URL displayed in the browser address bar after a webpage loads into a label 
identifying the contents of the webpage. 

The most basic heuristics we apply in the canonicalization function handle 
differences in the URL which rarely impact content. We remove the www subdo- 
main at the beginning of the full domain name, convert all URLs to be lower 
case and remove any trailing "/" at the end of the URL. Beyond these, we as- 
sume that any two URLs which differ prior to the query string will correspond 
to webpages which are not the same. This assumption is not inherent to the 
concept of a canonicalization function and could be removed by modifying the 
canonicalization function to also operate on the domain, port and path of a URL. 

Having canonicalized the domain, port and path of the URL, we now identify 
any arguments appearing in the query string which appreciably impact page 
content. We enumerate all arguments which appear in any URL at the website, 
and then for each argument enumerate all values associated with that argument 
and a list of all URLs in which each (argument, value) pair appears. We then 
iterate through arguments to identify arguments that significantly impact page 
content. For each argument, we randomly select 6 URL paths in which the 
argument appears and up to six distinct values of the argument for each URL 
path. Note that the impact of an argument can normally be determined by 
viewing simply a pair of argument values for a single URL path; we consider 
additional samples as described below to provide a margin of safety. 

The decision process for deciding whether a particular argument significantly 
influences content is as follows: If all pages with the same base URL appear the 
same, then the argument does not influence content. If pages with the same base 
URL appear different, and the argument being examined is the only difference in 
the URL, then the argument does influence content. In the case that pages with 
the same base URL do appear different and multiple arguments are different, 
additional investigation is necessary. If removal of the argument causes page 
content to change, then the argument influences page content. Alternately, if 
substitution of alternate argument values causes page content to change, then 
the argument influences page content. Once we have identified all arguments 
which do not impact page content, we canonicalize URLs by removing these 
arguments and their associated values. 

This approach to canonicalization makes several assumptions. The approach 
assumes that the impact of the argument is independent of the URL path. Addi- 
tionally, the approach assumes that the effect of each argument can be observed 
by manipulating that argument independent of any other arguments. To pro- 
vide limited validation of our assumptions, we perform a "safety check" for each 
website which randomly selects labels and compares URLs corresponding to the 
label to verify that page contents are comparable. 



